Sima Taparia, a matchmaker from Mumbai and the lead cast of Netflix’s latest show Indian Matchmaking has affected many. Some are whimpering and booing her aunty-gaze, others are nesting in the tales of how their match was designed. A day in the life of Sima is no less than that of a data-scientist – working with probability and estimating likelihood. She deploys her mind to optimise for a match. Sima’s work-day involves visiting homes of her clients, collecting data on their lifestyle and collating a list of their preferences. She works relentlessly on analysing unstructured data such as the demand for a “flexible” and “adjusting” bride. And for quantifiable variables such as “age” and “height”, she maintains a ledger, tracking the metrics of every person. Sima is a match() function algorithm in flesh and blood. If her work of crunching complex data into concrete traits was a comic strip, it could well be taglined “Aunty Taparia’s brain runs faster than a computer!”.
The accuracy of Indian Matchmaking is a comment on how matrimonial services are run in the country, even today. Despite dating platforms and online websites, marriage bureaus exist while print-ads continue to occupy expensive real-estate in newspapers. These ads, 3 x 5 cms in format, are packed with a laundry list of criteria. Each word chosen and paid-for forms what could be called the shortest pitch for a life-long commitment.
Inspired by the show, I picked up last Sunday’s paper and decided to pay attention to the ads. On that day (19 July, 2020), the paper published 95 matrimonial ads (link to data). All ads were heteronormative – they assumed heterosexuality as the norm: 32 men seeking brides and 63 women seeking grooms. The age-group of these match-seekers ranged from 24 to 50 years.
The daily for this exercise is an english language newspaper focused on a tier-I city. Either the newspaper arranges ads based on the caste and religion of the match-seeker or the ads mention them in their preferences. In this sample, only 3 ads were silent on caste and religion. 30 to 40 percent of the ads for both genders were from upper-caste groups of Brahmins, Jains and Agarwals.
Along with caste, religion, age and height, a typical ad shares details on a few more set features like the match-seeker’s education or profession and income. Information on father’s occupation is a recurring feature. So is skin color which ranges from “wheatish” to “very-fair”. On most accounts, families are looking within the same sect, but for a brighter gene.
# Description plot
library(stringr)
library(ggplot2)
library(grid)
library(plotly)
set.seed(122)
# Reading the dataset
data <- read.csv("./data/matrimoney.csv", header = TRUE, stringsAsFactor = FALSE)
# Function for adjectives, use case: x="ad posted by", for self = TRUE/FALSE, and match = TRUE/FALSE
find_adj <- function(x, self, match){
person <- subset(data, Ad.by == x)
if(self == TRUE){
adj <- unlist(strsplit(gsub('[[:punct:]](?!\\w)', '',
person$Self.description, perl=T), ' '))
adj
}else if(match == TRUE){
adj <- unlist(strsplit(gsub('[[:punct:]](?!\\w)', '',
person$Partner.preference,
perl=T), ' '))
adj
}
adj <- gsub("-family", "", adj)
adj <- data.frame(table(tolower(adj)))
adj <- adj[order(-adj$Freq),]
adj
}
# Creating four datasets for bride/groom, using words described for self/match. Here: g stands for groom (not girl) and b for bride (not boy)
g.adj <- find_adj(x = "Boy", self = TRUE, match = FALSE)
g.match <- find_adj(x = "Boy", self = FALSE, match = TRUE)
b.adj <- find_adj(x = "Girl", self = TRUE, match = FALSE)
b.match <- find_adj(x = "Girl", self = FALSE, match = TRUE)
# Adding scatter
g.adj$dot.x <- rnorm(n=nrow(g.adj), 12, 3)
g.adj$dot.y <- rnorm(n=nrow(g.adj), 35, 3)
g.match$dot.x <- rnorm(n=nrow(g.match), 13, 4)
g.match$dot.y <- rnorm(n=nrow(g.match), 14, 3)
b.adj$dot.x <- rnorm(n=nrow(b.adj), 37, 4)
b.adj$dot.y <- rnorm(n=nrow(b.adj), 35, 4)
b.match$dot.x <- rnorm(n=nrow(b.match), 38, 4)
b.match$dot.y <- rnorm(n=nrow(b.match), 14, 5)
# Putting it all in a single dataframe
g.adj$cat <- "g.adj"
g.match$cat <- "g.match"
b.adj$cat <- "b.adj"
b.match$cat <- "b.match"
two <- rbind(g.adj, g.match, b.adj, b.match)
# Hover labels
two <- do.call(rbind, lapply(unique(two$cat), function(t){
xyz <- subset(two, cat == t)
top <- head(xyz, 5)
bottom <- tail(xyz, 4)
topbottom <- c(as.character(top$Var1),
as.character(bottom$Var1))
xyz$labs <- ifelse(xyz$Var1 %in% topbottom,
as.character(xyz$Var1), NA)
xyz
}))
################# Plot code ####################################
abc <- ggplot(two) +
geom_rect(aes(xmin = 26, xmax = 49, ymin = 2, ymax = 48),
fill = "lightblue1") +
geom_rect(aes(xmin = 2, xmax = 24, ymin = 2, ymax = 48),
fill = "#F4F3AB") +
geom_hline(yintercept=25, color = "orchid4", size=2.5) +
geom_vline(xintercept=25, color = "orchid4", size=2.5) +
geom_rect(aes(xmin = 23, xmax = 27, ymin = 44, ymax = 48),
fill = "orchid4") +
geom_rect(aes(xmin = 2, xmax = 6, ymin = 23, ymax = 27),
fill = "tan1") +
geom_rect(aes(xmin = 45, xmax = 49, ymin = 23, ymax = 27),
fill = "tan1") +
geom_rect(aes(xmin = 23, xmax = 27, ymin = 2, ymax = 6),
fill = "orchid4") +
theme(
axis.ticks.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.y=element_blank(),
axis.text.y=element_blank(),
plot.margin=unit(c(2,1,1,1),"cm")
) +
scale_x_continuous(expand = c(0, 0), limits = c(0, 50)) +
scale_y_continuous(expand = c(0, 2), limits = c(0, 50)) +
geom_point(aes(dot.x, dot.y,
text = Var1),
colour = "thistle3", size = two$Freq) +
geom_text(aes(dot.x, dot.y, label=labs),colour="lightsalmon4",
size=4, parse = TRUE) +
annotate("text", x=25, y=46, label="For\nself",
colour="white",
size=4) +
annotate("text", x=25, y=4, label="For\nmatch",
colour="white",
size=4) +
annotate("text", x=4, y=25, label="Ad by\ngroom",
colour="black",
size=4) +
annotate("text", x=47, y=25, label="Ad by\nbride",
colour="black",
size=4) +
theme_void() +
annotate("text", x = 42, y = 0,
label = "As used in ads printed on July 19\nin a leading daily",
size = 4,
colour = "black") +
annotate("text", x = 7, y = 50,
label = "Words used:",
size = 6,
colour = "black")
# Minor plotly fixes
ax <- list(
zeroline = FALSE,
showline = FALSE,
showticklabels = FALSE,
showgrid = FALSE
)
p <- ggplotly(abc, tooltip = "text") %>% layout(xaxis = ax, yaxis = ax) %>%
layout(hoverlabel = list(bgcolor= 'indianred2'),
font = list(
family = "Agency FB",
size = 30,
color = '#ffffff'))
pOver 200 adjectives are sprinkled across this sample of 95 ads. The matrimonial print-dictionary is hard to decipher. PQM, meaning “Professionally-Qualified-Match” and SM4 i.e. “Suitable-Match-For” are conversation starters. Each letter costs, so SMS lingo works. Vowels are needless when “b’ful, cltrd, stld” can be understood as “beautiful, cultured, settled” and not “bountiful, cluttered and startled”. An “NM” could excite you thinking they want “nothing much” out of you, but “Non-Manglik” is serious signalling, as is “I’less” i.e. “issueless” that normally precedes a widow or a divorcee. This vocabulary makes it tougher to scale data-collection of matrimonial ads using machines. But it saves money and gets the job done.
There are also some very-specific asks such as a “fair and smart IIM graduate” looking only for “IIM-ISB-SP-Jain-MBA graduate”, “between 28-31-years” and “5’2-to-5’7 ft height”. Or a “handsome” boy looking for a girl who is “homely”, “non-working”, with “adjusting-nature” and also must be “Bsc(PCM)-Msc(Math)” only. Or a “beautiful” girl with only one criteria, that of finding a “central-government-civil-servant” groom.
Putting people in boxes and matching boxes of different shapes and sizes is a hard task. I hope those who are holding Sima aunty for her error-rate understand her efforts are meaningless, if the stars don’t align.
Get in touch
surbhibhatia1906[at]gmail[dot]com